Unit 2 - Subsetting DataFrames and Series


In [ ]:
import pandas as pd

In [ ]:
df = pd.read_csv('../data/airbnb_input.csv', index_col='room_id')

On df DataFrame you can find data about Airbnb rooms in Lisbon.

Go ahead and preview it!


In [ ]:
# code to preview df

Row selection

Selecting rows by index position - iloc

  • With function iloc you can select specific rows from a DataFrame.
  • In order to specify the rows you want to select you can use, for instance, the row position (integer starting from 0) or a list of row positions.

Here, we are selecting the first row and getting it as a pandas Series:


In [ ]:
df.iloc[0]

But you can also get a single row as a DataFrame:


In [ ]:
df.iloc[[0]]

Now try to get rows 0, 1 and 3, using iloc:


In [ ]:
# code to select rows 0, 1 and 3 from df

Array slices also work, so try to select all the rows from position number 2 to position number 7:


In [ ]:
# code to select rows [2, 7] from df

Selecting rows by index name - loc

  • With function loc you can select specific rows from a DataFrame, like with iloc.
  • The difference here is that you specify the rows to select using the rows' indexes.

Here, we're selecting the row that corresponds to room 17031 (with index 17031):


In [ ]:
df.loc[17031]

Now, if you select a row for an index that doesn't exist you'll get a KeyError:


In [ ]:
df.loc[[1]]

Try to select the rows that correspond to rooms 17031 and 25659:


In [ ]:
# code to select rooms 17031 and 25659

Now check what happens when you pass a boolean array to loc!


In [ ]:
# code to call loc with a boolean array

Column selection

Selecting columns by name - dot, brackets

You can select columns using dot notation, like this:


In [ ]:
df.room_type

Or using brackets, like this:


In [ ]:
df['room_type']

Now, try to select column room_type from df, but getting the result as a DataFrame instead of a Series:


In [ ]:
# code to get column room_type from df as a DataFrame

And now try to select columns room_type and neighborhood:


In [ ]:
# code to select columns room_type and neighborhood

If you're feeling confident, here's a little challenge: select column room_type using loc!


In [ ]:
# code to select column room_type using loc

Mask function

Mask function can be used to "hide" the rows that verify a certain condition:


In [ ]:
df.mask(df.overall_satisfaction == 5.0)

Where function

And with where function, you can hide the rows that don't verify a certain condition:


In [ ]:
df.where(df.overall_satisfaction == 5.0)

Filter data

  • You can also use conditions inside brackets to filter data from the DataFrame
  • When you do it you get a DataFrame that has a different shape from the initial one
  • While with mask/filter, the DataFame shape is the same, and you just get NaNs in the cells that verify/don't verify the conditions

Here we're selecting only the rows of rooms in the Alvalade neighborhood:


In [ ]:
df[df.neighborhood == 'Alvalade']

Selection based on data types

Check DataFrame's data types by using the class attribute dtypes:


In [ ]:
df.dtypes

In order to select columns based on their data type, use the function select_dtypes:


In [ ]:
df.select_dtypes(include=['float64'])

Writing data to files

To output your processed data to a file, pandas has many function to help you with that:

Check more here !

Here, we're exporting df to a csv file called airbnb_output.csv on the data directory of this unit:


In [ ]:
df.to_csv('../data/airbnb_output.csv')